Diacritics correction in Turkish with context-aware sequence to sequence modeling
نویسندگان
چکیده
Digital texts in many languages have examples of missing or misused diacritics which makes it hard for natural language processing applications to disambiguate the meaning words. Therefore, restoration is a crucial step languages. In this study we approach problem as bidirectional transformation diacritical letters and their ASCII counterparts, rather than unidirectional diacritic restoration. We propose context-aware character-level sequence model transformation. The independent sense that no language-specific feature extraction necessary other utilization word embeddings directly applicable trained Turkish correction task assessment used tweets benchmark dataset. Our best setting proposed improves state-of-the-art results terms F1 score by 4.7% on ambiguous words 1.24% over all cases.
منابع مشابه
Effects of diacritics on Turkish information retrieval
We investigate the effects of improper use of diacritics in the Turkish alphabet on information retrieval. A diacritic is simply a supplementary sign added to a letter to change the sound value of the letter, and the Turkish alphabet has 5 special letters derived from Latin by adding different diacritics. The statistical analysis performed in this study shows that retrieval performance signific...
متن کاملCAPS: Context Aware Personalized POI Sequence Recommender System
The revolution of World Wide Web (WWW) and smart-phone technologies have been the key-factor behind remarkable success of social networks. With the ease of availability of check-in data, the location-based social networks (LBSN) (e.g., Facebook, etc.) have been heavily explored in the past decade for Point-of-Interest (POI) recommendation. Though many POI recommenders have been defined, most of...
متن کاملSentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction
We demonstrate that an attention-based encoder-decoder model can be used for sentence-level grammatical error identification for the Automated Evaluation of Scientific Writing (AESW) Shared Task 2016. The attention-based encoder-decoder models can be used for the generation of corrections, in addition to error identification, which is of interest for certain end-user applications. We show that ...
متن کاملSequence to Sequence Modeling for User Simulation in Dialog Systems
User simulators are a principal offline method for training and evaluating human-computer dialog systems. In this paper, we examine simple sequence-to-sequence neural network architectures for training end-to-end, natural language to natural language, user simulators, using only raw logs of previous interactions without any additional human labelling. We compare the neural network-based simulat...
متن کاملSequence-Aware Recommender Systems
Characterization. Adopting the formalisms of [3], we can describe the problem at a more formal, abstract level as follows. Let C be a set of users and I a set of recommendable items. In contrast to matrix-completion problems, we are not interested in predicting a utility value for each i ∈ I and for each c ∈ C , but in computing an ordered list of objects L of length k for each user, where each...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Turkish Journal of Electrical Engineering and Computer Sciences
سال: 2022
ISSN: ['1300-0632', '1303-6203']
DOI: https://doi.org/10.55730/1300-0632.3948